Robust Near- Isometric Matching via Structured Learning of Graphical 

Models 

Julian J. McAuley* Tiberio S. Caetano and Alexander J. Smola 

September 21, 2008 



Abstract 

Models for near-rigid shape matching are typically based 
on distance-related features, in order to infer matches that 
are consistent with the isometric assumption. However, real 
shapes from image datasets, even when expected to be re- 
lated by "almost isometric" transformations, are actually 
subject not only to noise but also, to some limited degree, 
to variations in appearance and scale. In this paper, we in- 
troduce a graphical model that parameterises appearance, 
distance, and angle features and we learn all of the involved 
parameters via structured prediction. The outcome is a 
model for near-rigid shape matching which is robust in the 
sense that it is able to capture the possibly limited but still 
important scale and appearance variations. Our experimen- 
tal results reveal substantial improvements upon recent suc- 
cessful models, while maintaining similar running times. 

1 Introduction 

Matching shapes in images has many applications, includ- 
ing image retrieval, alignment, and registration [1,2,3,4]. 
Typically, matching is approached by selecting features for 
a set of landmark points in both images; a correspondence 
between the two is then chosen such that some distance 
measure between these features is minimised. A great deal 
of attention has been devoted to defining complex features 
which are robust to changes in rotation, scale etc. [S^]. 1 

An important class of matching problems is that of near- 
isometric shape matching. In this setting, it is assumed 
that shapes are defined up to an isometric transformation 
(allowing for some noise), and therefore distance features are 
typically used to encode the shape. Some traditional meth- 
ods for related settings focus on optimisation over the space 
of rigid transformations so as to minimise least-squares cri- 
teria [11, 12]. 

Recently, this class of problems has been approached 
from a different perspective, as direct optimisation over the 
space of correspondences [13]. Although apparently more 
expensive, there it is shown that the rigidity assumption 
imposes a convenient algebraic structure in the correspon- 

*The authors are with the Statistical Machine Learning Program 
at NICTA, and the Research School of Information Sciences and En- 
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x We restrict our attention to this type of approach, i.e. that of 
matching landmarks between images. Some notable approaches devi- 
ate from this norm - see (for example) [7,8,9,10]. 



dence space so as to allow for efficient algorithms (exact 
inference in chordal graphical models of small clique size). 
More recently, these methods have been made substantially 
faster [14]. The key idea in these methods is to explicitly 
encode rigidity constraints into a tractable graphical model 
whose MAP solution corresponds to the best match. How- 
ever, the main advantages of correspondence-based optimi- 
sation over transformation-based optimisation, namely the 
flexibility of encoding powerful local features, has not been 
further explored in this framework. 

Other lines of work that optimise directly over the corre- 
spondence space are those based on Graph Matching, which 
explicitly model all pairwise compatibilities and solve for the 
best match with some relaxation (since the Graph Match- 
ing problem is NP-hard for general pairwise compatibili- 
ties) [15,16,17]. Recently, it was shown both in [18] and 
in [19] that if some form of structured optimisation is used 
to optimise graph matching scores, relaxed quadratic assign- 
ment predictors can improve the power of pairwise features. 
The key idea in these methods is to learn the compatibility 
scores for the graph matching objective function, therefore 
enriching the represent ability of features. A downside of 
these graph matching methods however is that they do not 
typically make explicit use of the geometry of the scene in 
order to improve computational efficiency and/or accuracy. 

In this paper, we combine these two lines of work into 
a single framework. We produce an exact, efficient model 
to solve near-isometric shape matching problems using not 
only isometry-invariant features, but also appearance and 
scale- invariant features, all encoded in a tractable graphical 
model. By doing so we can learn via large-margin structured 
prediction the relative importances of variations in appear- 
ance and scale with regard to variations in shape per se. 
Therefore, even knowing that we are in a near-isometric 
setting, we will still capture the eventual variations in ap- 
pearance and scale into our matching criterion in order to 
produce a robust near-isometric matcher. In terms of learn- 
ing, we introduce a two-stage structured learning approach 
to address the speed and memory efficiency of this model. 

The remainder of this paper is structured as follows: in 
section 2, we give a brief introduction to shape matching 
(2.1), graphical models (2.2), and discriminative structured 
learning (2.3). In section 3, we present our model, and 
experiments follow in section 4. 



2 Background 

2.1 Shape Matching 

'Shape matching' can mean many different things, depend- 
ing on the precise type of query one is interested in. Here 
we study the case of identifying an instance of a template 
shape (S C T) in a target scene (Li) [ ]. 2 We assume that 
we know <S, i.e. the points in the template that we want 
to query in the scene. Typically both T and U correspond 
to a set of 'landmark' points, taken from a pair of images 
(common approaches include [6,20,21,22]). 

For each point t G T and u G U, a certain set of unary fea- 
tures are extracted (here denoted by </>(£), 4>(u)), which con- 
tain local information about the image at that point [5,6]. 
If y : S — > U is a generic mapping representing a potential 
match, the goal is then to find a mapping y which minimises 
the aggregate distance between corresponding features, i.e. 

\s\ 

y = f(S,U) = axgmin^ci(si,2/(si)) (1) 

y i=i 

where 

c 1 (s l ,y(s i )) = U(s l )-<j>(y(s l ))\\l (2) 

(here ||-|| 2 denotes the L 2 norm). For injective y eq. (1) is a 
linear assignment problem, efficiently solvable in cubic time. 

In addition to unary or first-order features, pairwise or 
second-order features can be induced from the locations of 
the unary features. In this case eq. (1) is generalised to 
minimise an aggregate distance between pairwise features, 
i.e. 

\s\ \s\ \s\ 

y = argmin^ci^,^)) + ^ ^ c 2 (s*, Sj, 
y i=i i=i 3 =i 

(3) 

This however induces an NP-hard problem for general c 2 
(quadratic assignment). Discriminative structured learn- 
ing has recently been applied to models of both linear and 
quadratic assignment (eq. (1) and eq. (3)) in [18]. Here 
we exploit the structure of c 2 that arises from the near- 
isometric shape matching problem in order to make such a 
problem tractable. 

2.2 Graphical Models 

In isometric matching settings, one may suspect that it 
may not be necessary to include all pairwise relations in 
quadratic assignment. In fact a recent paper [ ] has shown 
that if only the distances as encoded by the graphical model 
depicted in figure 1 (top) are taken into account (nodes rep- 
resent points in S and states represent points in W), exact 
probabilistic inference in such a model can solve the isomet- 
ric problem optimally. That is, an energy function of the 

2 Here T is the set of all points in the template scene, whereas 
S corresponds to those points in which we are interested. It is also 
important to note that we treat S as an ordered object in our setting. 




Figure 1: Top: The graphical model introduced in [14]. 
Bottom: The clique- graph of this model. 

following form is minimised: 3 

\s\ 

^2 C2< ^' S *+l> 2/( S *)> V( S i+l)) + C l( S ii S i+2> V( s i)i V( s i+2))' 

(4) 

Although the graphical model in figure 1 (top) does not 
form a single loop (a condition typically required for con- 
vergence of belief propagation [23,24,25]), [14] show that 
it is sufficient that the clique graph forms a single loop in 
order to guarantee convergence to the optimal assignment 
(figure 1, bottom). Furthermore, it is shown in [ ] that the 
number of iterations required before convergence is small in 
practice. 

We will extend this model by including a unary term, 
y(si)) (as in (eq. 1)), as well as a third-order term, 
c 3 (si,Si+ 1 ,Si+2,y(si),y(si+ 1 ),y(si+ 2 )y, the graph topology 
remains the same. Note that in order to guarantee conver- 
gence, we do not require any specific form for the potentials, 
except that no assignment has infinite cost [14]. 

2.3 Discriminative Structured Learning 

In practice, feature vectors may be very high-dimensional, 
and which components are 'important' will depend on the 
specific properties of the shapes being matched. Therefore, 
we introduce a parameter, which controls the relative 
importances of the various feature components. Note that 

3 Si+i should be interpreted as mod \S\ 0- e - the points form 

a loop). 
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is parameterising the matching criterion itself. Hence our 
optimisation problem becomes 



Solving (8) requires only that we are able, for any value 
of 0, to find 



where 



y = f(S,U;0) 



\s\ 



argmax(ft,(<S, U, y), ( 

v 



(5) 



h(S, U,y) = s i+u s i+2j y(si), y(s i+1 ), y{s i+2 )). 

i=i 

(6) 

(y is a mapping from S to U, $ is a third-order feature vector 
- our specific choice is shown in section 3). 4 In order to mea- 
sure the performance of a particular weight vector, we use 
a loss function, A(y,y l ), which represents the cost incurred 
by choosing the assignment y when the correct assignment is 
y % (our specific choice of loss function is described in section 
4). To avoid overfitting, we also desire that 6 is sufficiently 
'smooth'. Typically, one uses the squared L 2 norm, ||0||2, 
to penalise non-smooth choices of [26]. 

Learning in this setting now becomes a matter of choos- 
ing such that the empirical risk (average loss on all train- 
ing instances) is minimised, but which is also sufficiently 
'smooth' (to prevent overfitting). Specifically, if we have 
a set of training pairs, {(^S 1 ,^ 1 ), . . . , (S N ,U N )}, with la- 
belled matches {y 1 . . . y N \, then we wish to minimise 



1 N 

-^A(/(S\^;0),?/)- 



(7) 



empirical risk 



regulariser 



Here A (the regularisation constant) controls the relative im- 
portance of minimising the empirical risk against the reg- 
ulariser. In our case, we simply choose A such that the 
empirical risk on our validation set is minimised. 

Solving (eq. 7) exactly is an extremely difficult problem 
and in practice is not feasible, since the loss is piecewise 
constant on the parameter 6. Here we capitalise on re- 
cent advances in large- margin structured estimation [26], 
which consist of obtaining convex relaxations of this prob- 
lem. Without going into the details of the solution (see, for 
example, [26,27]), it can be shown that a convex relaxation 
of this problem can be obtained, which is given by 



mm 



e N 



1 N 



(8a) 



subject to 

(MS*, WW) - KS\U\y),9) > A(y,y i ) - & 

for all i and y G y (8b) 

(where y is the space of all possible mappings). It can be 
shown that for the solution of the above problem, we have 
that > A(/(5 i ,^ i ;6>),?/ i ). This means that we end up 
minimising an upper bound on the loss, instead of the loss 
itself. 



4 We have expressed (eq. 5) as a maximisation problem as a matter 
of convention; this is achieved simply by negating the cost function in 
(eq. 6). 



argmax {(h(S r ' 
y 



,U i ,y),e) + A(y,y i )). 



(9) 



In other words, for each value of 0, we are able to identify 
the mapping which is consistent with the model (eq. 5), 
yet incurs a high loss. This process is known as 'column 
generation' [ , ]. As we will define our loss as a sum over 
the nodes, solving (eq. 9) is no more difficult than solving 
(eq. 5). 

3 Our Model 

Although the model of [ 4] solves isometric matching prob- 
lems optimally, it provides no guarantees for near-isometric 
problems, as it only considers those compatibilities which 
form cliques in our graphical model. However, we are often 
only interested in the boundary of the object: if we look 
at the instance of the model depicted in figure 2, it seems 
to capture exactly the important dependencies; adding ad- 
ditional dependencies between distant points (such as the 
duck's tail and head) would be unlikely to contribute to 
this model. 

With this in mind, we introduce three new features (for 
brevity we use the shorthand yi = y(si)): 

$10*1, 52,2/1,2/2) = (di(si,s 2 ) - d 1 (y u y 2 )) 2 , 

where di(a, b) is the Euclidean distance between a and 
6, scaled according to the width of the target scene. 

$2(si, s 2 ,S3,2/i,2/2,2/3) = (d 2 (s 1 ,s 2 ,s 3 ) - d 2 (2/i,2/2,2/3)) 2 , 
where d 2 (a,b,c) is the Euclidean distance between a 
and b scaled by the average of the distances between a, 
6, and c. 

$3(si,s 2 ,s 3 ,2/i,2/2,2/3) = ( Z ( 5l ' 52 ' 5 s) - ^(2/i,2/2,2/3)) 2 , 
where Z(a, 6, c) is the angle between a and c, w.r.t. b. 5 

We also include the unary features <E>o(si, 2/i) = (<K s i) — 
0(2/1 )) 2 (i- e - t ne pointwise squared difference between 0(si) 
and 0(2/1)). $1 is exactly the feature used in [ ], and is 
invariant to isometric transformations (rotation, reflection, 
and translation); ^> 2 and $3 capture triangle similarity, and 
are thus also invariant to scale. In the context of (eq. 6), we 
have 

-§2,53,2/1,2/2, 2/3) := [$o(5i,2/i), 

$1(51,52,2/1,2/2) + $i(5i, 5 3 , 2/1, 2/3), 
$2(51,52,53,2/1,2/2,2/3) + $2(51,53,52,2/1,2/3,2/2), 

$ 3 (5i,5 2 ,5 3 ,2/i, 2/2,2/3)]. (10) 

This demands some explanation: only two pairwise depen- 
dencies ($1) are included in each clique - this is done to 
ensure that each pairwise dependency is included exactly 



5 Using features of such different scales can be an issue for regular- 
isation - in practice we adjusted these features to have roughly the 
same scale. For full details, our implementation is available at (not 
included for blind review). 
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Figure 2: Left: the (ordered) set of points in our template shape (5). Centre: connections between immediate neighbours. 
Right: connections between neighbour's neighbours (our graphical model). 



once, as the remaining dependency is captured by an ad- 
jacent clique. Furthermore, we have included two scaled 
distances and one angle ($2 and $3) - although we could 
have included as many as three scaled distances and three 
angles, we have instead included exactly what is required to 
capture triangle similarity. Finally, we have enforced that 
features of the same type are given the same weight (<E>i 
and $2)7 simply by adding the different instances of these 
features. 

In practice, landmark detectors often identify several 
hundred points [ , ], which is clearly impractical for an 
0(|<S| |£Y| 3 ) method (\U\ is the number of landmarks in the 
target scene). To address this, we adopt a two stage learning 
approach: in the first stage, we learn only unary compati- 
bilities, exactly as is done in [ ]. During the second stage 
of learning, we collapse the first-order feature vector into a 
single term, namely 



$o(si,2/i) = (0o,$o(si,2/i)> 



(11) 



(0o is the weight vector learned during the first stage). We 
now perform learning for the third-order model, but con- 
sider only the p 'most likely' matches for each node, where 
the likelihood is simply determined using &' (si,yi). This 
reduces the memory and runtime requirements to 0(\S\p 3 ). 
A consequence of using this approach is that we must now 
tune two regularisation constants; this is not an issue in 
practice, as learning can be performed quickly using this 
approach. 6 



4 Experiments 
4.1 House Data 

In our first experiment, we compare our method to those 
of [ ] and [ ]. Both papers report the performance of 
their methods on the CMU 'house' sequence - a sequence 
of 111 frames of a toy house, with 30 landmarks identified 
in each frame. 7 As in [18], we compute the Shape Context 
features for each of the 30 points [ ] . 



6 In fact, even in those cases where a single stage approach was 
tractable (such as the experiment in section 4.1), we found that the 
two stage approach worked better. Typically, we required much less 
regularity during the second stage, possibly because the higher order 
features are heterogeneous. 

^http : //vase . ri . emu . edu/idb/html/mot ion/house/index . html 



In addition to the unary model of [ ] , a model based on 
quadratic assignment is also presented, in which pairwise 
features are determined using the adjacency structure of 
the graphs. Specifically, if a pair of points (pi 1 P2) in the 
template scene is to be matched to (#1,^2) in the target, 
there is a feature which is 1 if there is an edge between pi 
and P2 in the template, and an edge between q\ and q<i in the 
target (and otherwise). We also use such a feature for this 
experiment, however our model only considers matchings 
for which (^1,^2) forms an edge in our graphical model (see 
figure 3, bottom left). The adjacency structure of the graphs 
is determined using the Delaunay triangulation, (figure 3, 
top left). 

As in [ ] , we compare pairs of images with a fixed base- 
line (separation between frames). For our loss function, 
A y l ),we used the normalised Hamming loss, i.e. the pro- 
portion of mismatches. Figure 4 shows our performance on 
this dataset, as the baseline increases. On top we show 
the performance without learning, for which our model ex- 
hibits the best performance by a substantial margin. 8 Our 
method is also the best performing after learning (figure 
4 (bottom))- in fact, we achieve almost zero error for all 
but the largest baselines (at which point our model assump- 
tions become increasingly violated, and we have less training 
data). 

In figure 5, we see that the running time of our method 
is similar to the quadratic assignment method of [ ]. To 
improve the running time, we also show our results with p = 
10, i.e. for each point in the template scene, we only consider 
the 10 'most likely' matches, using the weights from the first 
stage of learning. This reduces the running time by more 
than an order of magnitude, bringing it closer to that of 
linear assignment; even this model achieves approximately 
zero error up to a baseline of 60. 

Finally, figure 6 (top) shows the weight vector of our 
model, for a baseline of 70. The first 60 weights are for the 
Shape Context features (determined during the first stage of 
learning), and the final 5 show the weights from our second 
stage of learning (the weights correspond to the first-order 



8 Interestingly, the quadratic method of [ ] performs worse than 
their unary method; this is likely because the relative scale of the 
unary and quadratic features is badly tuned before learning, and is 
indeed similar to what the authors report. Furthermore, the results 
we present for the method of [18] after learning are much better than 
what the authors report - in that paper, the unary features are scaled 
using a pointwise exponent (— exp(— \<j> a — </>b| 2 )), whereas we found 
that scaling the features linearly (\<p a — (j)b\ 2 ) worked better. 
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Figure 3: Left: The adjacency structure of the graph (top); the boundary of our 'shape' (centre); the topology of our 
graphical model (bottom). Right: Example matches using linear assignment (top, 6/30 mismatches), quadratic assignment 
(centre, 3/30 mismatches), and the proposed model (bottom, no mismatches). The images shown are the 3 rd and 93 rd 
frames in our sequence. Correct matches are shown in green, incorrect matches in red. All matches are reported after 
learning. 



features, distances, adjacencies, scaled distances, and an- 
gles, respectively - see section 3). We can provide some 
explanation of the learned weights: the Shape Context fea- 
tures are separated into 5 radial, and 12 angular bins - the 
fact that there is a peak for the 14 th , 26^, and 38^ features 
indicates that a particular angular bin is more important 
than the others; the fact that the final 12 features have low 
weight indicates that the most distant radial bin has little 
importance (etc.). It is much more difficult to reason about 
the second stage of learning, as the features have different 
scales, and cannot be compared directly - however, it ap- 
pears that all of the higher-order features are important to 
our model. 

It is worth briefly mentioning that we also ran this ex- 
periment using our model, but including only the adjacency 
features, and ignoring all third- order features - i.e. replicat- 
ing exactly the experiment from [18], but including only the 
limited dependencies captured by our model. In this exper- 
iment, the model of [ ] performed better than ours; this 
indicates that the benefit of using an exact algorithm does 
not exceed the cost of capturing only limited dependencies. 
Indeed, this indicates that the third-order features are play- 
ing a very significant role in contributing to the performance 
of our model. 

4.2 Synthetic Data 

For this experiment, our 'shape' consists of 25 points ran- 
domly distributed on the silhouette of the painting shown 
in figure 7 (note that this shape exhibits less structure than 
those in our other experiments, due to the random ordering 



of the points). In addition to the points on our shape, a 
number of outliers are randomly distributed on the silhou- 
ette. 10 training, testing, and validation images are then 
generated by randomly perturbing the x and ^-coordinates 
of these points by between — e/2 and e/2 pixels, where ep- 
silon ranges between and 20. This produces 45 pairs of 
images for training, validation, and testing. This experi- 
ment is aimed at examining the robustness of our approach 
to noise and outliers, as well as the effect of choosing differ- 
ent values for p. 9 

The results of this experiment are shown in figure 8. Note 
that the 'point matching' method is only shown for zero 
outliers, as the method becomes intractable as \U\ increases. 
The quadratic assignment method of [ ] is not shown for 
this experiment, as the adjacency information in the graph 
is not robust to random error, or the addition of outliers 
(it performed far worse than the techniques shown). Since 
we cannot hope to get exact matches, we use the endpoint 
error instead of the normalised Hamming loss, i.e. we reward 
points which are close to the correct match. 10 Figure 8 
also examines the effect of choosing different values of p 
(the number of points considered during the second stage of 
learning) . 

Given that our datapoints are generated randomly, we 
observe little improvement from learning when using first- 
order features. Although the higher-order model provides 
no benefit when there are no outliers, it is highly beneficial 
once outliers are introduced; we also observe a significant 

9 Note that setting p = 1 essentially recovers the linear method 

of [18]. 

10 Here the endpoint error is just the average Euclidean distance from 
the correct label, scaled according to the width of the image. 
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Figure 8: Comparison of our technique against that of [ ] ('point matching'), and [18] ('linear'). Results are shown 
from errors (e) ranging from 2 to 20 pixels, for (top), 25 (middle), and 75 (bottom) outliers. Results before learning 
are shown on the left, results after learning are shown on the right. Note the log-scale of the ?/-axis. Error bars indicate 
standard error. In many plots, the performance is almost identical for different values of p. 
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exhibits the best performance both before and after learn- 
ing (note the different scales of the two plots). Error bars 
indicate standard error. 
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Figure 6: Top: The weight vector of our method after learn- 
ing, for the 'house' data. The first 60 weights are for the 
Shape Context features from the first stage of of learning; 
the final 5 weights are for the second stage of learning. Bot- 
tom: The same plot, for the 'bikes' data. 
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Figure 7: The silhouette (from Steve Abbott's painting of 
Fingask Castle) on which our points are distributed. The 
'shape' of our model is shown in green, with outliers shown 
in blue. 

benefit from learning, which likely indicates that the rela- 
tive weights of the low and higher-order features are being 
adjusted. Finally, although we observe poor performance 
for p = 5, we observe almost no difference when increasing 
p from 10 to 20. 

4.3 Bikes Data 

For our final experiment, we used images of bicycles from 
the Caltech 256 Dataset [29]. Bicycles are reasonably rigid 
objects, meaning that matching based on their shape is log- 
ical. Although the images in this dataset are fairly well 
aligned, they are subject to reflections as well as some scal- 
ing and shear. For each image in the dataset, we detected 
landmarks automatically, and six points on the frame were 
hand-labelled (see figure 9). Only shapes in which these in- 
terest points were not occluded were used, and we only in- 
cluded images that had a background; in total, we labelled 
44 images. The first image was used as the 'template', the 
other 43 were used as targets. Thus we are learning to 
match bicycles similar to the chosen template. 

Initially, we used the SIFT landmarks and features as 
described in [ ]. Since this approach typically identifies sev- 
eral hundred landmarks, we set p = 20 for this experiment 
(i.e. we consider the 20 most likely points). Again we use 
the endpoint error for this experiment. Table 1 reveals that 
the performance of this method is quite poor, even with the 
higher-order model, and furthermore reveals no benefit from 
learning. This may be explained by the fact that although 
the SIFT features are invariant to scale and rotation, they 
are not invariant to reflection. 

In [28], the authors report that the SIFT features can 
provide good matches in such cases, as long as landmarks 
are chosen which are locally invariant to affine transforma- 
tions. They give a method for identifying affine- invariant 
feature points, whose SIFT features are then computed. 11 
We achieve much better performance using this method, 
and also observe a significant improvement after learning. 
Figure 9 shows an example match using both the unary and 
higher-order techniques. 

Finally, figure 6 (right) shows the weights learned for this 
model. Interestingly, the first-order term during the second 
stage of learning has almost zero weight. This must not 

11 We used publicly available implementations of both methods. 



be misinterpreted: during the second stage, the response 
of each of the 20 candidate points is so similar that the 
first-order features are simply unable to convey any new 
information - yet they are still very useful in determining 
the 20 candidate points. 

5 Discussion and Future Work 

While our model seems well motivated when applied to the 
problem of 'shape' matching (i.e. when the shape has a 
clearly defined boundary), we are clearly making a trade- 
off when applying our model to the more general problem 
of matching point-patterns. In such cases, we are at a dis- 
advantage due to the fact that we capture only a fraction 
of the desired dependencies, but we are at an advantage in 
that our model is exact, and also that it is able to cap- 
ture higher-order properties of the scene. Interestingly, we 
found that the exactness of our model alone does not make 
up for this limitation. This reveals the surprising result that 
the scale- invariant third-order features are able to capture 
a great deal of additional information that is not present at 
lower orders. 

A hurdle faced by our approach is that of occlusions (ei- 
ther due to the landmark detector failing to identify part of 
the shape, or simply due to part of the shape being missing 
from the scene). Occlusions are of little concern to a first- 
order model, as an incorrect assignment to a single point 
has no effect on other assignments, whereas they may ad- 
versely effect our model, as the assignments are inextricably 
linked. In this paper, we have effectively dealt with the first 
issue (i.e. that of the landmark detector failing to identify 
an important point), by using learning to select candidate 
landmarks. Dealing with occlusions explicitly is an impor- 
tant future addition to our model. 

Another issue we encountered was that of feature scaling. 
For instance, suppose we express angles in degrees rather 
than radians; from the point of view of our model, this 
should make no difference - we would just scale the corre- 
sponding weights by 7r/180; but from the point of view of 
the regularise^ this is a very significant change - it is much 
more 'expensive' to include a feature with a small scale (rel- 
ative to other features) than it is to include a feature with 
a large scale. In theory, we would like to include many dif- 
ferent features, and have the learning algorithm separate 
the good from the bad; in practice, this was not possible, 
as we were forced to address the relative scale of our fea- 
tures before we were able to do learning. 12 This appears to 
be a fundamental issue when applying learning to models 
with heterogeneous features, for which we are not aware of 
a principled solution. 

In this paper we have used 'off-the-shelf landmark detec- 
tors, and only applied learning after landmarks have been 
detected. Since we know the 'type' of landmarks we want 
in advance (they are labelled in the template scene), it may 
be possible to apply learning to the landmark detector itself 
in order to further improve the performance of our model. 

12 For full details, our implementation is available at (our implemen- 
tation will be made available at the time of publication) 
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Figure 9: Top: A selection of our training images. Bottom: An example match from our test set. Left: The template 
image (with the shape outlined in green, and landmark points marked in blue). Centre: The target image, and the match 
(in red) using unary features with the affine invariant /SIFT model of [28] after learning (endpoint error = 0.34). Right: 
the match using our model after learning (endpoint error = 0.04). 



Table 1: Performance on the 'bikes' dataset. The endpoint error is reported, with standard errors in parentheses (note 
that the second-last column, 'higher-order' uses the weights from the first stage of learning, but not the second). 







SIFT [ ] 


Affine invariant/SIFT [ ] 


unary 


training: 


0.335 (0.038) 


0.321 (0.018) 




validation: 


0.346 (0.027) 


0.337 (0.015) 




testing: 


0.371 (0.011) 


0.332 (0.024) 


+ learning 


training: 


0.277 (0.024) 


0.286 (0.024) 




validation: 


0.325 (0.020) 


0.300 (0.020) 




testing: 


0.371 (0.011) 


0.302 (0.016) 


higher-order 


training: 


0.233 (0.047) 


0.205 (0.043) 




validation: 


0.223 (0.025) 


0.254 (0.035) 




testing: 


0.289 (0.045) 


0.294 (0.034) 


+ learning 


training: 


0.254 (0.046) 


0.211 (0.036) 




validation: 


0.224 (0.025) 


0.234 (0.035) 




testing: 


0.289 (0.045) 


0.233 (0.034) 
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It would also be possible to allow for shapes which are 
rigid in some parts, but less so in others. For instance, 
although the handlebars, wheels, and pedals appear in sim- 
ilar locations on all bicycles, the seat and crossbar do not; 
we could allow for this discrepancy by learning a separate 
weight vector for each clique. 

6 Conclusion 

We have presented a model for near-isometric shape match- 
ing which is robust to typical additional variations of the 
shape. This is achieved by performing structured learning 
in a graphical model that encodes features with several dif- 
ferent types of invariances, so that we can directly learn 
a "compound invariance" instead of taking for granted the 
exclusive assumption of isometric invariance. Our experi- 
ments revealed that structured learning with a principled 
graphical model that encodes both the rigid shape as well 
as non-isometric variations gives substantial improvements, 
while still maintaining competitive performance in terms of 
running time. 
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